Skip to content

fix(coherence): Debugging to Get Qwen Working In Agentic Coding#90

Open
tbraun96 wants to merge 50 commits into
mainfrom
fix/in-think-tool-call-leak
Open

fix(coherence): Debugging to Get Qwen Working In Agentic Coding#90
tbraun96 wants to merge 50 commits into
mainfrom
fix/in-think-tool-call-leak

Conversation

@tbraun96
Copy link
Copy Markdown
Contributor

The bug. Live-repro'd against opencode driving Qwen/Qwen3.6-35B-A3B-FP8 on Atlas with 9 tool schemas, temperature=0.3, thinking_in_tools=true. The model emits literal <tool_call><function=…><parameter=…>…</parameter></function></tool_call> XML inside its <think> block, repeated ~6 times before the thinking budget force-closes </think>. The literal XML reaches opencode's rendered transcript as reasoning_content deltas and visibly contaminates the agentic session. Reproducible across multiple sessions; see /home/nologik/opencode-session.md and /workspace/atlas-dumps/opencode.jsonl.

Root causes (3-agent Explore + 1 Plan):

  1. Thinking-loop watchdog's period range is 4..20 tokens (helpers.rs:70-79) — the leaked XML is ~50 tokens/occurrence, outside the detector.
  2. Single-token tool_call_start_token hard-mask at decode_logits_seq.rs:137-143 is a no-op when <tool_call> is multi-token.
  3. Grammar bitmask is disabled while thinking (decode_logits_seq.rs:261-273).
  4. SimHash, loop-watchdog, F11/F44/F5/Bug-2 caps all gated off in thinking.
  5. Post-hoc reasoning stripper at handle_token.rs:91-112 is per-delta and can't stop the model from generating the next repetition.
  6. Sampling regime: opencode forces temp=0.3, top_p=1 (vs preset 0.6/0.95); DRY disabled; 9 tool schemas in context apply prior pressure.

MTP is not a cause (drafts are greedy argmax; F62 SSM rollback safe; leak repros pre-MTP-enable).

Fix — two layers

Layer D (hotfix, 1 line)

kernels/gb10/qwen3.6-35b-a3b/MODEL.toml: thinking_in_tools = true → false. With thinking off during tool-active turns the <think> block never opens. Re-enable once Layer A is validated end-to-end.

Layer A (real fix, ~75 LoC in 2 files)

Cross-delta scanner in the chat_stream thinking branch. On match: drop the delta, set tool_loop_capped, set stop_string_triggered, set cancel_flag — scheduler finalises within one decode step (PR #89), handle_done emits finish_reason="length" via the PR #87 override.

Files:

  • crates/spark-server/src/api/chat_stream/state.rsreasoning_xml_scan_buf (rolling 256-char tail) + reasoning_xml_leak_detected one-shot.
  • crates/spark-server/src/api/chat_stream/handle_token.rs — scanner + short-circuit guard in the thinking phase only.

Triggers: <tool_call>, <function=, <parameter=, <invoke (Qwen3-Coder, Hermes, MiniMax-XML variants). Gated on ctx.tool_defs_for_backfill non-empty so a benign chat that mentions <tool_call> does not trip.

Layer B (deferred)

Watchdog period extension — needs codegen plumbing through 6 files for new MODEL.toml [behavior] keys. Tracked separately; Layer A alone closes the leak.

Verification

  • Local: cargo check, cargo clippy --tests, cargo fmt --check, cargo test -p spark-server (484 passed). handle_token.rs 476 LoC (≤500 cap).
  • Live replay: image rebuilt locally, deployed on dgx1, replay of jq 'select(.seq==1)' /workspace/atlas-dumps/opencode.jsonl. Expect: no <tool_call> / <function= / <parameter= substrings in reasoning_content; terminal SSE finish_reason: "length"; the new tracing::warn! line appears in server logs.

Docker Hub push held until user re-runs an opencode session and confirms the leak is gone.

Stacked on top of PR #89 (cancel_flag plumbing) — base branch is fix/scheduler-cancel-flag's landing point. Once #89 merges to main this will rebase cleanly.

tbraun96 and others added 3 commits May 22, 2026 21:12
…ps the response

When the Bug-2 name-run cap (or F11 within-dedup / F5 cross-flush dedup /
F44 perm-fail circuit-breaker) forcibly ends a streaming response,
`finish_reason` was previously `"tool_calls"` — because tool calls *were*
emitted, just truncated mid-loop. Agent clients (opencode and friends)
see a normal-looking tool-call completion, dutifully run the tools, send
the next request, and the model loops again — Atlas was breaking the
loop one round at a time without ever telling the client.

Add a `tool_loop_capped: bool` on `StreamState`, flipped true alongside
`stop_string_triggered` at every tool-call loop guard (4 sites in
`tool_handlers.rs`). `handle_done` reads it and overrides `fr` to
`"length"` — OpenAI's spec slot for "response was forcibly truncated" —
ahead of the existing `"tool_calls"` / `finish_reason` fall-throughs.
This gives every agent client a clean, spec-compliant hook to break its
outer retry loop without needing Atlas-specific headers.

Also dumped to the `--dump` synthesized-response body for observability.

Verified: `cargo check`, `cargo clippy --tests`, `cargo fmt --check` all
clean. Live repro will follow once the image is rebuilt.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ppress output

The PR #87 fix changed finish_reason to "length" when a tool-loop guard
trips, so agent clients can break their outer retry loop — but only when
the scheduler actually finalises and emits Done. Live repro on opencode
revealed the deeper bug: setting `stop_string_triggered = true` in
chat_stream only suppresses *output*; the scheduler keeps generating
tokens until natural EOS or `max_tokens`. On a degenerate-loop response
(model not EOS-ing), this manifests as a hang — the stream silently
consumes tokens, the channel can fill, the scheduler can block on
`blocking_send`, GPU goes 0%, no Done event ever fires, opencode sits
forever waiting on the SSE stream.

Add a cooperative cancellation flag plumbed from chat_stream into the
scheduler:

  Arc<AtomicBool> cancel_flag
    │
    ├── created in chat_stream/mod.rs
    ├── passed into InferenceRequest::Streaming { cancel_flag, .. }
    ├── stashed on StreamState (cancel_flag) — chat_stream flips true on:
    │     • Bug-2 name-run cap trip (handle_complete_tool_call,
    │       handle_tool_call_end)
    │     • F11 within-response dedup
    │     • F44 perm-fail circuit-breaker
    │     • cross-flush tool_arg_dedup trip
    │     • loop-watchdog fire (SimHash + substring repeat)
    └── carried through PrefillInProgress → ActiveSeq on the scheduler
        side; `emit_step::emit_token` reads it at the top of every
        token-emit and sets `a.finished = true` if flipped — equivalent
        to an EOS, so the existing finalize path runs and `handle_done`
        emits the proper `tool_loop_capped` / `finish_reason="length"`
        chunks + `[DONE]`.

Spill-restored ActiveSeq carries `cancel_flag: None` — the original
streaming connection is long gone by the time a swapped-out seq resumes
from disk. /v1/completions also passes a fresh never-flipped flag so
the scheduler's type-check is satisfied; the guard pipeline doesn't run
on that legacy path yet.

Verified: `cargo check`, `cargo clippy --tests`, `cargo fmt --check`,
`cargo test -p spark-server` (484 passed), `cargo build --release` all
clean.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…+ opencode)

Layer-D hotfix + Layer-A primary fix (Layer-B watchdog-period extension
deferred to a follow-up — the codegen plumbing for new MODEL.toml
[behavior] keys spans 6 files and Layer-A on its own closes the leak
end-to-end).

**The bug.** With opencode (9 tool schemas, temp=0.3, 16k+ system prompt)
Qwen3.6-35B-A3B emits literal `<tool_call><function=…><parameter=…>…
</parameter></function></tool_call>` XML INSIDE its `<think>` block,
repeated 6+ times before the thinking budget force-closes `</think>`.
The leak reaches opencode's rendered transcript as `reasoning_content`
deltas and visibly contaminates the agentic session. Three-agent
investigation pinpointed six root causes; this patch closes the two
biggest gaps.

**Layer D — `kernels/gb10/qwen3.6-35b-a3b/MODEL.toml`** (hotfix):
`thinking_in_tools = true → false`. With thinking off during tool-active
turns, the `<think>` block never opens and the leak cannot occur.
Re-enable once Layer A is validated end-to-end via opencode.

**Layer A — stream-side scanner with hard cancel** (the real fix):

- `crates/spark-server/src/api/chat_stream/state.rs`: two new fields on
  StreamState — `reasoning_xml_scan_buf: String` (≤256-char rolling
  tail of decoded reasoning text across deltas) and
  `reasoning_xml_leak_detected: bool` (one-shot).
- `crates/spark-server/src/api/chat_stream/handle_token.rs` (thinking
  branch only): after the existing per-delta strippers, scan the
  cross-delta buffer for `<tool_call>` / `<function=` / `<parameter=` /
  `<invoke ` openers (covers Qwen3-Coder, Hermes, MiniMax-XML
  variants). On first match:
    1. Drop the current delta (no reasoning_chunk emit).
    2. Set `state.tool_loop_capped = true` — PR #87's override in
       `handle_done.rs:151` then maps `finish_reason` to `"length"`,
       which agent clients (opencode, Claude Code, …) treat as
       "forcibly truncated" and break their outer retry loop.
    3. Set `state.stop_string_triggered = true` for the suppression
       path.
    4. Set `state.cancel_flag` (PR #89) — the scheduler reads it at
       the top of `emit_step::emit_token` and finalises the sequence
       within one decode step.
    5. `tracing::warn!` with model + request id + last 64 chars of
       the scan buffer for postmortem.
  Subsequent thinking-phase tokens for the stream short-circuit at the
  top of the branch with empty SSE output (handles the in-flight race
  between cancel set and emit_token noticing).

  Scanner is gated on `ctx.tool_defs_for_backfill` being non-empty
  (i.e. tools_active) so a benign chat that genuinely discusses the
  `<tool_call>` literal does not trip it.

**Verified.** `cargo check`, `cargo clippy --tests`, `cargo fmt --check`,
`cargo test -p spark-server` (484 passed) all clean. `handle_token.rs`
476 LoC — under the 500 cap. Live replay verification follows image
rebuild against the running dgx1 instance.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@tbraun96 tbraun96 requested a review from AzeezIsh as a code owner May 23, 2026 13:45
tbraun96 and others added 17 commits May 24, 2026 13:20
Checkpoint of work-in-progress before the Phase A injection-removal
refactor (see bench/qwen36_fp8_dequant_audit/prompt_injection_audit.md
and post_processor_comparison_atlas_vs_vllm.md).

Phase 2 numerical-drift fixes (Qwen3.6-35B-A3B-FP8):
- RNE rounding in f32_to_bf16 (atlas-quant/src/fp8.rs +
  spark-model/src/weight_map/fp8_lut.rs)
- __expf softmax in prefill_paged_compute.cuh (polynomial path
  retained behind ATLAS_FAST_SOFTMAX_EXP)
- FP16 P x V MMA (10-bit mantissa vs BF16's 7-bit) in
  prefill_paged_compute*.cuh

Watchdog rewrites (scheduler/{helpers,decode_logits_*}.rs):
- vLLM-anchored content-loop detector (has_repeating_pattern_anchored)
- disable_watchdogs() env-var kill switch
- mid-word </think> defer via mid_word_token_mask
- describe_content_token_loop diagnostic re-scan on fire

Build infra (atlas-kernels/build.rs + build_target.rs):
- Dedup + std::thread::scope parallel NVCC compile; 7min -> 75sec
  (242 unique invocations vs 1169 raw, 4.8x dedup, 20 workers)

MODEL.toml sampler band-aid (all 11 GB10 model dirs):
- Empirical defaults (temp=0.6, pres=0.0, rep_pen=1.1, dry=0.5) that
  produced clean opencode sessions. Documented Qwen-card "correction"
  regression to anchor future tuning.
- max_inter_tool_prose raised 384 -> 2048 (was killing legitimate
  multi-paragraph explanations between tool calls).

Audit artifacts (bench/qwen36_fp8_dequant_audit/):
- prompt_injection_audit.md cataloging 13 always-on prompt mutations
- post_processor_comparison_atlas_vs_vllm.md identifying 5 vLLM wins
- per_model_sampler_recommendations.md research baseline

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Atlas previously ran a stateful agentic-failure-handling layer that
mutated the user-supplied message array in 13 distinct ways before
tokenization — none opt-in per request, all keyed on heuristics over
message history (audit in bench/qwen36_fp8_dequant_audit/
prompt_injection_audit.md). Phase 2 precision fixes (RNE, __expf,
FP16 P×V, vLLM-anchored loop detection) make these obsolete and they
now train the model to react to XML tags it wasn't trained on.

Deleted injection sites:
  1. Tool-parser system_prompt prepend (api/chat/mod.rs:106-118) —
     jinja template's `tools` argument is the legitimate scaffolding
  2-9. F-feature failure guards (chat_phases::apply_failure_guards
     and the entire api/failures/ subtree): F7 stall reminder, F23
     progress reminder, F29 environment_facts, F31 hard-refusal,
     F32 dup tool_result, F35 failure_recovery, F39 circuit banner,
     F49 dup-write banner, F50 original-error append
  10. Loop-detector <IMPORTANT> hint append (loop_detect.rs:131-146)
  11. task_pin verbatim-goal reminder ("The user's ORIGINAL request
     was: «...»") — task_pin.rs deleted, caller in loop_detect.rs
  12. observation_mask body rewrite (rewrote past tool/user error
     bodies in place with "[stale tool failure N/M: ...]") —
     observation_mask.rs deleted, caller in chat/mod.rs
  13. Responses-API instructions stacking (responses_lowering.rs):
     simplified to spec-faithful insert at pos 0 (no drop-of-prior)
  14. /v1/completions `<think></think>\n\n` raw-prompt prefix
     (completions.rs:74-78)

Preserved legitimate streaming guards (NOT prompt injections; operate
purely on outbound model text) by extracting them from the deleted
failures subtree into a new api/stream_guards.rs module:
  - bump_f12_tool_call_count: per-response tool-call cap
  - check_loop_watchdog: post-detector repeating-line detector
  - flush_content_sanitizer: stream-end tag-tail drain

Telemetry counters removed alongside their injectors:
  - atlas_task_pin_injections_total
  - atlas_observation_mask_elided_bodies_total

Net: -3480 LoC (40 files), zero injection sites remaining (verified
zero matches for atlas_runtime_notice / failure_recovery /
environment_facts / IMPORTANT> / stale tool failure / ORIGINAL
request was). cargo check + cargo check --tests both green.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
…ctionParams

Two small per-request controls ported from vLLM. Phase B-2
(thinking_token_budget) was already present on ChatCompletionRequest
(top-level + chat_template_kwargs paths), so only B-1 and B-3 needed
work.

B-1: stop-string hold-back buffer (vLLM IncrementalDetokenizer pattern)
- StreamCtx.stop_string_buffer_len cached once at request setup as
  max(stop_strings.len()) - 1
- StreamState.stop_string_emitted_len tracks forwarded bytes
- handle_token::apply_stop_string_holdback() helper:
  - On miss: hold back trailing buffer_len bytes (snapped to UTF-8
    char boundary via floor_char_boundary)
  - On match: truncate BOTH accumulator and emittable delta at
    match position (mirrors vLLM detokenizer.py:140-142)
- handle_done flushes any held-back tail bytes through the active
  detector/sanitizer at stream close
- 3 unit tests: chunk-boundary leak prevention, no-stop-strings
  zero behavior change, UTF-8 boundary safety
- Hand-tuned <think>-tag boundary deferral untouched

B-3: per-request RepetitionDetectionParams
- New struct {min_pattern_size, max_pattern_size, min_count} on
  ChatCompletionRequest + CompletionRequest + InferenceRequest
- Threaded into ActiveSeq, PrefillInProgress, SwappedSeq (survives
  swap/restore)
- helpers.rs: new detect_*_token_loop_with(tokens, Option<override>)
  variants; legacy detect_*_token_loop() delegates to _with(.., None)
  for byte-identical fallback to watchdog_params() constants
- decode_logits_{step,content}.rs pass a.repetition_detection to the
  detectors
- Fuzzy + digit-normalized SimHash detectors left untouched
- 3 unit tests for override path (loosen / tighten / thinking-loop)

Test cleanup: detects_fence_body_with_varying_prefixes — was written
for the OLD scan-anywhere detector and broke when we switched to
vLLM-anchored (commit 1bb82ed). Inverted assertion + renamed to
rejects_*: anchored detector intentionally does not fire on this
varying-prefix pattern; the rollback-to-boundary machinery catches
the resulting tighter loop one layer up.

cargo check + cargo check --tests + clippy all green.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Live opencode session (opencode-phaseAB.jsonl, 2026-05-24) on the
post-PhaseB image surfaced two distinct user-visible failures:

1. **Content-loop watchdog firing on legitimate JSON tool-call body.**
   `CONTENT_LOOP_MIN_REPEATS=2` + `PERIOD_MIN=2` meant the watchdog
   fired on FOUR matching end-anchored tokens. JSON structural
   punctuation (`","`/`":"`) naturally forms period-2 patterns
   inside tool-call bodies — observed 18:13:18 firing at
   content_tokens=48 inside a bash() invocation, ending the response
   with `reason=NoBoundary` (rollback declined because mid-grammar).
   The tool-salvage path recovered a garbled `bash({"command":""})`.

   Fix: `CONTENT_LOOP_MIN_REPEATS` 2 → 3. Matches vLLM's
   `RepetitionDetectionParams.min_count` default. Period-2 with 3
   repeats now requires 6 matching end-anchored tokens — still
   catches genuine `[A, B]` attractors (~100ms after onset), but
   tolerates the `","`/`":"` JSON-punctuation band.

   Per-request `repetition_detection` (Phase B-3) still wins over
   this default, so clients with stricter or looser needs can
   override.

2. **2048-token thinking budget burned on agentic turns.**
   Two failures traced to `max_thinking_budget=2048` (qwen3.6 MODEL.toml):
   - seq=1 (title-gen, 0 tools): 2068 thinking tokens for a 7-token
     title.
   - seq=8 (agentic turn 9): full 2048 budget exhausted, model
     force-closed via `</think>` injection, then emitted EOS with
     zero content tokens. 33s of stall.

   Fix: `qwen3.6-35b-a3b/MODEL.toml.max_thinking_budget` 2048 → 768.
   Bounds worst-case thinking burn at ~30s. Typical agentic
   thinking is 200-500 tokens; 768 leaves headroom. Clients can
   pass `thinking_token_budget` per-request (Phase B-2) for chat
   workflows that genuinely need more.

Tests updated: `content_loop_accepts_two_repeats` → `accepts_three_repeats`
+ added negative `rejects_two_repeats`; `override_loosens_content_loop_threshold`
updated for new default (3-repeat fires, 4-repeat doesn't).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Live opencode session on `atlas-gb10:hotfix` (opencode-hotfix.jsonl
seq=10, 2026-05-24): after Atlas's tool-validation correctly rejected
a `write({})` call with empty filePath, the model degenerated into a
doom loop emitting partial `<tool_call>` envelopes that never
closed. The streaming sanitizer correctly detected the orphan opener
and flipped `suppressing_param_leak=true`, then drained the
subsequent content waiting for a close — but no close arrived for
8192 tokens (max_tokens limit). The content-loop watchdog couldn't
catch this because partial-envelope periods exceed
`CONTENT_LOOP_PERIOD_MAX=64`. User-visible result: 116s of "hung"
streaming with finish=length, content="" (sanitizer suppressed all
of it).

Fix: add `suppress_streak_tokens: u32` to `StreamState`. Per-token
on the chat_stream path, increment when `suppressing_param_leak` is
true at end-of-token; reset when it flips false. When the streak
exceeds `MAX_SUPPRESS_STREAK_TOKENS=256`, trip the same kill switch
the content-loop watchdog uses (`loop_watchdog_triggered=true`,
`stop_string_triggered=true`, `cancel_flag.store(true)`). 256 tokens
is enough headroom for legitimately long tool-call bodies (large
`content` strings on a `write` call) while bounding worst-case
wasted decode at ~10s @ 30 tok/s.

The 8192-token doom loop now ends in ~10s with finish_reason=length
and a tracing warn line carrying the streak count, so future
occurrences are self-debuggable.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Scaffold the `LogitsProcessor` trait + 8 processor implementations
from the inline pre-sample masking block in
`decode_logits_seq::process_seq_logits`. **Not yet wired** — this
commit only introduces the new module; the inline block in
`process_seq_logits` still runs the same logic. Part 2 will replace
the inline block with `run_pipeline(...)`.

Module layout (`crates/spark-server/src/scheduler/logit_processors/`):

- `mod.rs`: `LogitsContext`, `ProcessorOutcome` enum, `LogitsProcessor`
  trait with `apply(&mut [f32], &mut ActiveSeq, &LogitsContext) ->
  ProcessorOutcome`, `name()`, `is_argmax_invariant()`, and the
  `run_pipeline(...)` driver wiring the eight stages in this order:
    1. F2ConfidenceEarlyStop
    2. MidWordThinkEndMask
    3. PostCloseThinkMask
    4. ToolCallDuringThinkingMask
    5. ForcedThinkEndInjector
    6. PinToToolCallStart
    7. ForcedTokenFastPath        ← may short-circuit (EmitToken)
    8. GrammarBitmaskApply

- One file per processor (~40-80 LoC each) preserving byte-identical
  gate semantics from the monolith:
    f2_confidence.rs / mid_word.rs / post_close.rs /
    tool_during_think.rs / forced_think_end.rs / pin_tool_call.rs /
    forced_token.rs / grammar_bitmask.rs

- `pipeline_tests.rs` (cfg=test): stage-name stability + distinctness,
  `is_argmax_invariant` advertisement matrix, `should_inject_think_end`
  truth table, defer-override constants pin. Integration replay
  against the wired pipeline lives in opencode-session.md once
  part 2 lands (ActiveSeq is too complex to fixture cleanly).

Out of scope (intentional): AdaptiveSamplingState entropy observation
runs after this pipeline — it's a sampling-policy decision, not a
logit transform. The final `sample_with_params_history` call is also
downstream. Both stay in `process_seq_logits` after the wire.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
… fall-through

Live opencode-hotfix2.jsonl seq=4 (2026-05-24 19:00→19:02): the
hotfix-2 streak watchdog DID NOT fire on an 8215-token doom loop
(content="", finish=tool_calls, max_tokens=8192 effectively burned).
Root cause: `handle_token` has 8+ early-return branches; the streak
check was at the natural fall-through, so most token paths skipped
it entirely.

Fix: split `handle_token` into a thin wrapper + `handle_token_inner`
body. The wrapper runs the streak update + kill check AFTER the
inner body returns, so every token path increments/resets the
counter and trips the kill switch when warranted. Body now has the
same early-return shape; only the watchdog block moved.

The 256-token streak bound was correct; just wasn't reachable.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
ROOT CAUSE found 2026-05-24 (opencode-hotfix2b.jsonl seq=13, 8222
tokens, finish=length, content="", reasoning=29):

`process_decode_logits` is the non-MTP decode path. It calls
`handle_content_token` which contains the content-loop watchdog
(period-2…64 detector, MIN_REPEATS=3). All the prior watchdog tuning
applies HERE.

`verify_k2_step` / `verify_k3_step` / `mtp_step` / `spec_step` are
the MTP / speculative-decode paths. They call `emit_token` directly,
which only bookkeeps the emitted token (push to output_tokens,
advance grammar, EOS check) — it does NOT run any watchdog.

When `--speculative` is on (which is on by default for this build
since `--speculative --mtp-quantization bf16` are in the docker
command), essentially every token goes through the MTP path. The
content-loop watchdog never fires on MTP-emitted tokens. The
recently-tuned MIN_REPEATS=3, the per-request RepetitionDetectionParams
plumbing, and the rollback machinery — all unreachable for the
common case.

Observed live: model degenerated into a period-4 `[29, 198, 510,
15704, …]` attractor (the `parameter>\n` doom-loop from the project's
prior memory note). 8193 content tokens of pure repeat, zero
watchdog fires, response burned to max_tokens=8192.

Fix: mirror the content-loop watchdog check inside `emit_token` for
the `!inside_thinking` branch. Increments `a.content_tokens`,
re-uses the same detector functions and gate constants
(`disable_watchdogs`, `enable_loop_watchdog`, `CONTENT_LOOP_*`,
`detect_content_token_loop_with`, normalized variant). On fire:
set `a.finished = true` and warn.

Skip rollback in this mirror — `emit_token` doesn't take
`&dyn Model` (the SSM rewind requires it) and plumbing it through
the 14+ MTP/spec call sites would be a separate diff. End-of-stream
is a strict improvement over the current 8192-token burn; rollback
can be added later if needed.

The non-MTP path retains rollback via `handle_content_token` unchanged.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
…aths

Closes the root-cause loop the team-debug investigation identified
(see bench/hotfix3-debug/SYNTHESIS.md): MTP K=2/K=3/K=4/dflash verify
paths were returning raw GPU argmax with NO pre-sample logit
processors applied. Every MTP-emitted token escaped grammar_bitmask,
mid_word_mask, post_close_mask, tool_during_think_mask,
forced_think_end_injector, pin_to_tool_call, forced_token_fastpath,
and f2_confidence_early_stop — causing grammar desync, malformed
tool calls (empty filePath, invented 'description' tool names),
mid-word </think> cuts, stray <think> re-entry, and the
prompt_tok>13k cliff documented in opencode-hotfix3.jsonl.

## Non-MTP path (P1a)

`decode_logits_seq::process_seq_logits`: 435 → 185 LoC. Replaced the
inline 200-line 8-stage masking block with a single
`run_pipeline(&mut f32_logits, a, &ctx)` call. `LogitsContext` carries
think_start/end and tool_call_start/end tokens; pipeline driver
short-circuits on ForcedTokenFastPath::EmitToken. Dequant block,
adaptive-sampling/sample_with_params_history, and
extract_logprobs_from_f32 unchanged.

Function-signature changes: `_think_start_token` and
`_tool_call_end_token` lost the underscore prefix — pipeline
processors consume them. Sole caller (`decode_logits_step.rs:82`)
already passes by name; no other call-site changes.

## MTP verify paths (P1b)

New helper: `crates/spark-server/src/scheduler/verify_pipeline_helper.rs`
(148 LoC):
- `verify_pick_with_pipeline(logits_bytes, ...)`: dequant + run_pipeline
  + argmax for ONE verify position (~0.4ms/position @ 256k vocab)
- `verify_pick_all_with_pipeline(model, argmax_ids, ...)`: copy
  [K, vocab] BF16 verify logits D2H, process all K positions in
  sequence; falls back to raw argmax on D2H failure

Wired into:
- `verify_k2_step` (k=2 mainline)
- `verify_k3_step` (k=3)
- `verify_k4_step` (k=4)
- `verify_dflash_step` (dflash spec-decode)
- `spec_step::step_self_spec` (self-speculation)
- `spec_step::step_ngram_verify` (ngram k=2)
- `mtp_step` (Phase A bootstrap + Phase B verify dispatch)

LogitsContext is built once per scheduler tick in `scheduler/mod.rs`
and threaded through `step_mtp`/`step_self_spec`/`step_ngram`.

Drafts are left alone — they already go through grammar bitmask
inside `run_mtp_propose_multi` per `mtp_head/forward.rs:382-463`.

GPU verify still bakes argmax inside the CUDA graph; the new path
adds a one-shot D2H copy (~K × 0.4ms) which is NOT graph-captured
(constraint from the helper design). Reuses
`model.copy_logits_to_host()` infrastructure rather than touching
`verify_b.rs` / `verify_c.rs` — additive, old argmax path remains for
non-pipeline call sites if needed.

## P2: restore `!a.inside_tool_body` gate on content-loop watchdog

Both `handle_content_token` (decode_logits_content.rs) and the
hotfix-3 MTP mirror in `emit_step.rs` now skip the watchdog inside
the tool body. xgrammar guarantees structural termination there;
repeated rejected drafts that look like a period-N attractor were
producing false-positive amputations of valid JSON. The
`parameter>\n`-style real loops are still caught the instant the
model exits the tool body.

## P3: non-silent gs.accept_token failure

`emit_step.rs::emit_token` now captures `gs.accept_token(tok)` return
value. On `false` (token violates grammar — xgrammar NPDA refused),
warn + `a.finished = true`. Today's silent desync corrupted all
downstream bitmasks for the rest of the response. After P1a+P1b,
this should rarely fire — defense-in-depth.

## P4: K2/K3/K4 ACCEPT/REJECT log fix

Replaced `seq_len.is_multiple_of(50)` gates (which logged every
REJECT but only every 50th ACCEPT, misleading the team-debug
investigation into thinking accept rate was 0.32% when actual was
41.5%). Now per-100-step summary `info!` with
accepts/rejects/rate via `AtomicU64` counters; per-step events
demoted to `debug!`.

## P5: flush `reasoning_tag_scan_buf` on `</think>`

`handle_token.rs` previously dropped up to ~18 trailing bytes of
every thinking block on `</think>` fire — the sanitizer's tail
buffer was never flushed. Added an explicit flush block (skipped
when reasoning-side suppression is active).

## Out of scope

Model-side FP8 KV precision drift at long context ("axut"/"withcurl"
spelling errors documented in findings-agent-C-tokenizer.md) is a
separate Phase 2c concern. The pipeline fix above does NOT address
those — they're emitted by the model, decoded correctly, rendered
correctly. See project_qwen36_phase2b_softmax_expf.md.

## Verification

- `cargo check --tests -p spark-server`: clean (0 errors, 0 warnings)
- `cargo clippy --tests -p spark-server`: clean (0 lints)
- `cargo check --workspace --tests`: clean

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Live-test on `atlas-gb10:realfix` showed every response ending at 30-60
content tokens with `gs.accept_token returned false — xgrammar NPDA
refused the emitted token` (P3 kill-switch) firing on tok=198 (\n) or
tok=29 (structural punctuation). Root cause: in
`verify_pick_all_with_pipeline`, the pipeline was applied to each of
the K verify positions independently, but the xgrammar matcher state
was NEVER ADVANCED between positions — position 1's bitmask was
computed against position 0's matcher state.

For K=2 spec decode:
  1. Position 0: bitmask correct (matches current matcher state)
  2. Position 1: bitmask STALE — matcher hasn't seen pick[0] yet
  3. Pipeline argmax for position 1 picks a token valid at position 0's
     state but possibly invalid at position 1's state
  4. emit_token(drafts[0]) → accept_token advances matcher → state now
     at position 1
  5. emit_token(verified[1]) → accept_token(pick[1]) — pick[1] is not
     in this state's bitmask → returns false → P3 ends response with
     finish_reason=length

Example: token 198 (`\n`) is legal at JSON-value-start (post-`:` in
`"key": <value>`) but illegal at JSON-comma-or-closebrace (post-value
in `"key": "value"`, where the matcher expects `,` or `}`). The bonus
token sampled at position 0's state landed on `\n`, but the real
position-1 matcher state demanded `,` or `}`.

Fix: between each verify position, speculatively `gs.accept_token(pick)`
to advance the matcher, then `gs.rollback(grammar_advances)` at end of
the helper. The real `emit_token` calls (which run after this helper
returns) re-advance the matcher normally based on which tokens the
scheduler actually accepts/rejects, so the speculative advance must be
reverted to avoid double-advancement.

Defensive check on speculative `accept_token` failure: if the pipeline
ever returns a token outside the current bitmask (would indicate a
real bug elsewhere — pipeline correctly applies the mask so this
shouldn't happen except after grammar termination from a forced-token
fastpath), log a `debug!` and stop speculating. The real emit_token
will still fire P3 if there's a genuine desync — better visibility,
no silent corruption.

Updated module docstring with the new contract + a "DO NOT REMOVE"
note pinning the fix against future "simplification" reverts.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
…nsion eliminated

Ran 18 KV cache configurations against the standard 18920-token probe
on Qwen3.6-35B-A3B-FP8, comparing per-layer hidden state cosine vs HF
BF16 reference. Configs cover all 6 supported KV dtypes (Fp8, Bf16,
Nvfp4, Turbo4, Turbo3, Turbo8) × varying --kv-high-precision-layers
× varying --fp8-kv-calibration-tokens.

## Findings (full table in VERDICT-final.md)

1. **8 configs gave bit-identical 0.9605 mean cosine** — bf16-all,
   bf16-hp-max, fp8-hp{5,10,max}, nvfp4-hp{5,10,max}. The
   chunked-prefill kernel dequants K/V back to BF16 before attention,
   so the prefill cosine bench is insensitive to storage dtype. KV
   cache choice does NOT move the precision needle for this probe.

2. **3 calibration configs gave bit-identical 0.9175** —
   --fp8-kv-calibration-tokens 64, 256, 512 all produce the same
   result. Calibration is either no-op for this probe or has a
   different code path that uniformly degrades.

3. **Turbo8 broken on Qwen3.6 hybrid arch** — produces all-NaN hidden
   states from L19 onward (first SSM-attention-interleave failure
   beyond the 4 working full-attention layers). Validated only on
   MiniMax M2.7's pure-attention layout per kv_cache.rs:132.

4. **Per-layer diff vs May 23 rne baseline** — current fp8-baseline
   is 0.5-1% worse than the May 23 rne dump at most layers. Confirms
   the project_qwen36_phase2b_softmax_expf memory note: __expf
   unmasked a deeper precision bug the polynomial was smoothing.

## Conclusion

KV cache is exonerated. The ~4% gap to HF reference lives in
compute, specifically in the post-Phase-2b kernel surface (RNE +
__expf + FP16 P×V). Day 2 starts with bisecting which kernel change
unmasked the regression, then per-kernel cosine bisection within
each layer to localize.

Files preserved:
- bench/phase2c-kv-sweep/runner.sh — config sweep automation
- bench/phase2c-kv-sweep/VERDICT-day1.md — initial draft
- bench/phase2c-kv-sweep/VERDICT-final.md — full 18-config analysis

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Bisect infra (compile + runtime gates for future precision tests):

- ATLAS_EXTRA_NVCC_FLAGS env var threaded into
  build_target::NvidiaTarget::compile — appends arbitrary nvcc args
  (typically -D<MACRO>=1) to every kernel compile. Cargo rerun gated
  via rerun-if-env-changed=ATLAS_EXTRA_NVCC_FLAGS.

- #ifdef ATLAS_DISABLE_FP16_PV in prefill_paged_compute.cuh — both
  HDIM=256 and HDIM=512 paths. When defined, reverts smem_P/P64 to
  __nv_bfloat16, P stores to __float2bfloat16_rn, V reads to direct
  bf16 packing, and the MMA op to .bf16.bf16. Companion bisect knob
  to the existing ATLAS_FAST_SOFTMAX_EXP gate (which already flips
  sw_exp polynomial vs __expf).

- ATLAS_DISABLE_RNE=1 runtime env-var in atlas-quant::fp8::f32_to_bf16
  and spark-model::weight_map::fp8_lut::f32_to_bf16 — reverts the
  Phase 2b RNE patch to plain truncation (the pre-Phase-2b behavior).
  No rebuild needed once the gated binary is shipped.

Day 2 bisect results (full table in VERDICT-day2.md):

| Config | mean | min | Δ vs baseline |
|---|---|---|---|
| fp8-baseline (current) | 0.9615 | 0.9179 | — |
| -DATLAS_FAST_SOFTMAX_EXP=1 | 0.9617 | 0.9184 | +0.0002 |
| -DATLAS_DISABLE_FP16_PV=1 | 0.9619 | 0.9172 | +0.0004 |
| ATLAS_DISABLE_RNE=1 | 0.9569 | 0.8990 | -0.0046 |
| rne (May 23 reference) | 0.9668 | 0.9290 | +0.0053 |

None of the three Phase 2b kernel changes individually accounts for
the 0.5% cosine gap vs the May 23 rne reference. RNE is actively
helping (disabling it regresses by 0.5%). Both __expf and FP16 P×V
are effectively neutral.

The rne reference dump's filesystem timestamp predates any docker
image currently on disk by 17+ hours, so its lineage is unclear —
likely a pre-Phase-2 baseline whose 0.5% gap to current realfix2
reflects accumulated infrastructure changes outside the Phase 2b
kernel surface. Direct precision comparison is partly invalid.

Day 3 attack vectors: per-sub-step cosine within each layer, MoE
expert-routing divergence audit, NVFP4 weight checkpoint comparison.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Days 1+2 falsified the FP8 KV cache hypothesis. Day 3 finds the
actual answer: the model degeneration is in the FP8 WEIGHT
quantization, not Atlas's compute or KV cache. Same Atlas
(atlas-gb10:realfix2), same probe, swap the weight checkpoint
from `Qwen/Qwen3.6-35B-A3B-FP8` to
`RedHatAI/Qwen3.6-35B-A3B-NVFP4`:

| Aspect | FP8 | NVFP4 |
|---|---|---|
| Throughput (short) | ~65 tok/s | 117 tok/s (1.8×) |
| TTFT | ~140ms | 72ms (½) |
| TOML quotes | single (invalid) | double (valid) |
| Token glue artifacts | "withcurl"/"axut" | none |
| Language switching | Indonesian descriptions | none |
| Code coherence | garbled | valid Rust+axum |
| Deep-context (9780 prompt) tool call | empty filePath, garbage args | "rm -rf … && mkdir -p …" valid shell |
| Watchdog amputations | frequent | none |
| Long-response finish | doom-loop "length" | natural "stop"/"tool_calls" |

Atlas's compute pipeline is correct. NVFP4's per-block FP8 scales ×
4-bit E2M1 nibbles give 16× finer per-element precision than dense
FP8 E4M3, holding late-layer K/V and MoE weights within reasonable
bounds where FP8 hits its E4M3 dynamic-range cliff and quantizes
coarsely. Project memory's note "NVFP4 KV best at deep, FP8 best
early" was correct — applied to WEIGHTS, not KV storage.

Verdict: production deployment of Qwen3.6-35B-A3B should use the
NVFP4 checkpoint, not FP8. atlas-gb10:realfix2 works directly with
RedHatAI/Qwen3.6-35B-A3B-NVFP4 — no rebuild needed.

Recommended invocation in bench/phase2c-kv-sweep/VERDICT-day3.md.

Also adds: bench/phase2c-kv-sweep/compare-configs.py — multi-config
cosine comparator used throughout the bisects.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Live NVFP4 testing (opencode-nvfp4.jsonl seq=15) hit a doom-loop
distinct from the hotfix-2b-handled cases: model emitted a
`<tool_call>` opener but never reached a matching `</tool_call>`
close. 8221 tokens of buffered tool-call args, all suppressed by the
sanitizer's orphan detection at the final flush, finish=length,
content="". User saw the request as "trapped" while it burned to
max_tokens=8192.

The hotfix-2b suppression-streak watchdog couldn't catch this
because `suppressing_param_leak` only flipped at the FINAL flush
(orphan detection ran at stream end, not during emission) — so the
streak counter never accumulated.

Fix: new ActiveSeq.tool_body_streak_tokens counter incremented in
emit_step::emit_token while a.inside_tool_body=true. Resets on
`<tool_call>` open or `</tool_call>` close. When the streak exceeds
MAX_TOOL_BODY_TOKENS=1024, set a.finished=true.

1024 is enough headroom for legitimate long tool-call bodies
(large `content` field on a `write` call) while bounding worst-case
wasted decode at ~15s @ 65 tok/s — vs the ~95s burn at max_tokens=
8192 we observed.

Companion to:
- hotfix-2b (suppress_streak_tokens) — catches orphan opener
  detected DURING the stream
- content-loop watchdog (decode_logits_content + emit_step) —
  catches period-N attractors OUTSIDE tool body
- max_inter_tool_prose (MODEL.toml) — caps free prose BETWEEN tool
  calls

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
User-prompted forensic audit of "every logic gate that picks the WRONG
quantization for FP8 weights when canonical FP8 paths exist."

Key finding: the initial assumption ("all FP8 weights re-quantized to
NVFP4") was wrong. Atlas's full-attention QKVO and routed MoE experts
DO run on canonical FP8 kernels for Qwen3.6-35B-A3B-FP8 (via
`set_fp8_experts` / `set_fp8_weights` in qwen35 loader). The
`quantize_to_nvfp4` boot-log lines come from a SMALLER set of weights.

5 bugs identified, ranked by severity:

1. HIGH — SSM decode is always NVFP4
   File: weight_loader/qwen35/load_layers/linear_attn_arms.rs:176-190
   30 of 40 layers (linear attention) decode through NVFP4-from-FP8
   weights. The FP8-native arm `build_linear_attention_fp8` exists
   in the same file (line 24) but is "permanently short-circuited"
   per the comment at load_layers.rs:334-342. The kernels and
   set_fp8_weights plumbing for Qwen3SsmLayer all exist.

   Likely the actual source of the "deep-layer drift" memory entries
   (project_qwen36_phase2b_softmax_expf, project_qwen36_drift_gdn_clean).
   Drift accumulates through the SSM stack and corrupts late
   full-attention layers L31-L39.

   Fix: route variant==Fp8Dequanted to build_linear_attention_fp8;
   remove parallel _nvfp4 build. Medium difficulty.

2. MED — Dead NVFP4 shared expert built but never consumed
   (ssm_qwen35.rs:184). Memory waste, no correctness impact.

3. MED — MoE router gate loaded via `dense` (raw bytes, no dtype
   conversion) then quantize_to_nvfp4'd. If gate is FP8 in
   checkpoint, the bytes are read as BF16. Fix: dense → dense_auto.

4. LOW — LM head same pattern. Currently dormant (Qwen3.6 FP8
   leaves lm_head BF16) but latent for any future FP8 checkpoint
   that quantizes the head.

5. LOW — MTP head double-quantizes via quantize_to_nvfp4 for all
   projections under MtpQuantization::Nvfp4. Default should be
   Fp8 when native_fp8.

Recommended single fix: Bug #1 (re-enable FP8-native SSM decode).
The required kernels and plumbing exist; only the dispatch needs
re-wiring.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Tried to re-wire LinearAttention to use `build_linear_attention_fp8`
for `Nvfp4Variant::Fp8Dequanted`. Build succeeded, container crashed
at layer 0 load with `cuMemcpyDtoDAsync_v2 status 1` (INVALID_VALUE).

Root cause (deeper than the CAUSAL-PATHWAY-AUDIT spotted): the
`Fp8Weight` struct's `row_scale` field documents itself as
"`[N]` f32 per-row dequant scale" (32 KB for N=8192). But
`load_fp8_block_scaled_as_fp8weight` populates that same field with
per-BLOCK BF16 scales of shape `[N/BS, K/BS]` (= 2 KB for
N=8192, K=2048, BS=128 → 64*16*2 bytes).

`build_linear_attention_fp8` then concatenates QKV+Z scales via:
  gpu.copy_d2d(qkv_fp8.row_scale, qkvz_scale_ptr, qkv_rows * 4)
which asks CUDA to copy 32 KB from a 2 KB source. CUDA correctly
returns INVALID_VALUE.

The downstream `fp8_gemm_n128` kernel that decode_batched calls also
doesn't take any scale argument — it expects single-scale FP8
(produced by `bf16_to_fp8` from a BF16 dense), not the block-scaled
FP8 the checkpoint ships. So even if the concat were fixed, the math
would be wrong (FP8 values un-rescaled by their per-block
multipliers).

Reverting the dispatch back to NVFP4-only for LinearAttention.
Updated comments at both sites to document why the FP8 arm stays
dead-coded and what kernel work would be required to revive it.

The CAUSAL-PATHWAY-AUDIT.md Bug #1 finding stands but is *open* — the
right fix is to rewrite the FP8 SSM GEMV/GEMM chain to consume the
per-block scales the checkpoint provides, not to flip a dispatch. In
the meantime, the production workaround is to serve the NVFP4
checkpoint (`RedHatAI/Qwen3.6-35B-A3B-NVFP4`).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
…parser

Multi-iteration work toward clean opencode multi-turn agentic sessions on
Qwen3.6-35B-A3B-FP8. The combined image (atlas-gb10:fp8-much-better) ships:

- Native FP8 SSM dispatch (block-scaled `w8a16_gemv` decode + single-scale
  `fp8_gemm_n128` prefill) — no NVFP4 detour. WeightQuantFormat enum tags
  Fp8Weight buffers with `Fp8BlockScaled`/`Fp8PerRow`/`Fp8SingleScale` so
  kernel-call sites can assert format and fail-fast on silent quant leaks.
  Concat math for QKVZ along the N-block axis writes
  `(qkv_rows/BS)*(K/BS)*2` bytes of BF16 scales per row, then
  `(z_rows/BS)*(K/BS)*2` — both Nq and Nz are required to align to BS=128
  by the on-disk Qwen FP8 format. Out-proj is loaded block-scaled directly.

- Streaming detokenizer fix: content phase now uses the full-decode +
  slice approach (matching the reasoning path). HF tokenizers' DecodeStream
  drops the leading metaspace byte at certain BPE-token boundaries for
  byte-level tokenizers like Qwen's, producing `name = test-rust-axum-v32version`
  in streamed `<parameter=content>` while non-streaming
  `tokenizer.decode(&all_toks)` yields the correct multi-line TOML.
  `tokenizer.decode(&state.all_toks)` + `full[state.emitted..stable_end]`
  is byte-exact vs non-streaming for both reasoning and content (verified
  live with temp=0 seed=42).

- MoE gate stays BF16 for native_fp8: the `[num_experts, h]` BF16 gate is
  in the FP8 release's `ignored_layers`; runtime-quantizing it to NVFP4
  (4-bit) destroys the precision the router needs at late layers where
  the top-8 weights cluster in `[0.105, 0.168]`. Non-FP8 variants keep
  the runtime NVFP4 quantize.

- Whitespace-only stream chunks are no longer dropped — `cleaned.is_empty()`
  instead of `cleaned.trim().is_empty()`. The `sanitize_content_chunk`
  holdback rollout commits leading `\n   ` indent as a chunk; previously
  the trim filter dropped that chunk and `state.emitted` advanced past
  the unemitted bytes, permanently losing indent.

- PR 73's qwen3_xml parser + schema-driven type coercion cherry-picked
  (qwen3_xml.rs, type_coerce.rs, group_e.rs tests, registration in
  tool_parser.rs). Live opencode v40 showed qwen3_xml itself confused
  the model under thinking_in_tools=true + long opencode prompt, so
  MODEL.toml defaults back to qwen3_coder (auto-detect from
  model_type) — but qwen3_xml stays in the binary as a selectable
  parser for downstream consumers who want the type-coercion behavior.

- post-think EOS guard scoped to `require_tool_call`. Was unconditional
  POST_THINK_MIN_CONTENT=16 — forced short answers like "4" or "Paris"
  to keep generating into chat-template artefacts (`\nuser\nassistant`).
  Now only suppresses EOS when tools are required, so plain-text answers
  let `<|im_end|>` fire as expected.

- MODEL.toml `thinking_in_tools = true` (matches user's "MUCH better"
  recall); `tool_call_parser` default left to auto-detection so
  qwen3_coder fires under `model_type = qwen3_6_moe`.

All WeightQuantFormat constructors tagged. cargo check / cargo clippy
green workspace-wide.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>


def fmt(x: float) -> str:
if x != x: # NaN
for h in headers[1:]:
deltas = [data[h][i] - base[i] for i in range(40)]
mean_d = float(np.nanmean(deltas))
max_d = max(deltas, key=lambda x: abs(x) if x == x else 0)
Comment thread bench/nemotron_hf_ref.py Fixed
TheTom added a commit to TheTom/atlas that referenced this pull request May 25, 2026
Applies the FP16 P×V MMA upgrade + __expf softmax replacement from
Avarok-Cybersecurity#90 (fix/in-think-tool-call-leak) to the asym
prefill kernel that this PR introduced, and pulls the same fix into
the upstream symmetric kernels we carried forward unchanged.

Motivation (per @tbraun96's PR Avarok-Cybersecurity#92 review + Discord context): the
prior `sw_exp` polynomial advertised ~1e-4 max relative error, but
verifies at ~5.6e-3 (~0.5%) against `torch.exp`. Across 18920-token
attention rows × 10 full-attention layers, that compounds to
measurable per-layer cosine drift vs HF reference. The FP16 P×V MMA
upgrade trades ~10% prefill slowdown for ~8× higher mantissa
precision on the softmax probabilities (P), which is the dominant
remaining attention-output drift source on Qwen3.6-35B-A3B-FP8.

Files:
- kernels/gb10/common/prefill_paged_compute.cuh — direct
  cherry-pick of the Phase 2c kernel changes + ATLAS_DISABLE_FP16_PV
  bisect toggle (matches PR Avarok-Cybersecurity#90 byte-for-byte in the fix regions).
- kernels/gb10/common/prefill_paged_compute_512.cuh — same sw_exp
  refactor for the HDIM=512 path.
- kernels/gb10/common/prefill_paged_compute_asym.cuh — TQ+ asym
  fork carries the same precision bug; applied the equivalent fix
  (helper + __half smem_P/P64 + __float2half_rn + .f16.f16 MMA).
  Skipped the ATLAS_DISABLE_FP16_PV debug toggle for now (can add
  later if the team wants to bisect asym paths separately).

Q×K stays BF16 (range matters there); P×V becomes FP16 (precision
matters, range is bounded [0,1] post-softmax). All bf16 stores now
use __float2bfloat16_rn for RNE rounding.

Verified locally: nvcc 13.0 compiles all 114 kernels clean on sm_120
including the TQ+ asym variants (bf16k_turbo3v, fp8k_turbo3v, etc).
`compile_qwen3_coder_tool_grammar` was forcing the tool-call body to
`json_schema` content type, but the qwen3_coder wire format the model
emits — and the parser at `parse_single_b.rs:60-132` expects — is
native XML (`<parameter=KEY>VALUE</parameter>`), not JSON. When the
FSM was in a JSON-string state and the model wanted to emit a single
BPE token that decoded to bytes containing `<parameter=` (or any
non-JSON-string-class character), xgrammar masked the entire token
and forced a near-substitute. The result, live in opencode multi-turn
agentic sessions on 2026-05-25:

  /tmp/test-rust-axum-v42 → /tmp/test-rust-axu-v4      (drops `m`, drops `2`)
  /tmp/test-rust-axum-v42 → /tmp/test-rust-axu-m-v4    (drops `2`, inserts `-`)
  /tmp/test-rust-axum-v42 → /tmp/test-rust-axum-v01    (4→0, 2→1)
  "filePath":"]}]}]}}}"                                 (JSON-delimiter cascade)

Switching body to `any_text` keeps the OUTER `<tool_call>\n<function=…>\n
…\n</function>\n</tool_call>` framing constrained by xgrammar (begin/end
fixed) while leaving the body bytes unconstrained, so native XML
`<parameter=>` blocks pass through intact. Schema validation remains:
`validate_single_tool_call` + `backfill_required_params` run host-side
after `parse_one_call` (`tool_handlers.rs:46`), catching any actual
schema violations after the fact. Mirrors MiniMax's grammar pattern at
`compile_minimax_xml_tool_grammar` line 472 which has used `any_text`
for the same native-XML reason.

Live verification (atlas-gb10:fp8-grammar-any-text, opencode axum v43):

  - v41 (nvfp4-KV + hp5):    0 files persisted (path corrupted)
  - v42 (fp8-KV uniform):    0 files persisted (path corrupted)
  - v43 (this commit):       20 files persisted, byte-exact
                             `/tmp/test-rust-axum-v43/Cargo.toml`,
                             full cargo new skeleton on disk.

Investigation thread that landed here:

  - 40-layer cosine study on dgx2 ruled out model precision drift
    (Atlas[FP8-native] vs HF[FP8→BF16] mean cos 0.99497, min 0.99012).
  - KV format swap (v41 nvfp4-hp5 vs v42 fp8-uniform) ruled out KV.
  - Six parallel Explore agents on dgx1 ruled out:
    * HF DecodeStream (audit found no live call sites; 8d2cc87 was clean).
    * Sampler penalties in tool body (`inside_tool_body` flips correctly
      on token 248058; rep_pen/DRY/freq/presence zeroed in body).
    * Stop-string holdback (char-boundary safe, flushed on done).
    * Mistral whitespace-skip char-boundary issue (latent, separate bug).
    * Sanitizer `state.emitted` cursor desync (real content-phase bug
      to fix separately; not the tool-arg path).
  - Three web-research agents pinpointed vLLM PR #35615 (Qwen3Coder
    spec-decode parameter loss) but reading Atlas's `streaming_impl.rs`
    showed Atlas already buffers until `</tool_call>` and parses the
    full inner via `parse_one_call`, so the close-before-drain pattern
    is structurally absent here. The remaining hypothesis — grammar
    bitmask over-constraining valid BPE tokens at the JSON-schema
    body — survived; this commit confirms it.

Tests updated (`grammar/tests/qwen3_coder_required.rs`): the previous
suite pinned "grammar REJECTS empty body" — a property of the prior
`json_schema` body type. With `any_text`, required-param enforcement
moves host-side to `validate_single_tool_call`. New tests pin the
envelope shape and accept BOTH native XML and legacy JSON bodies
(the parser supports both via its JSON fallback at parse_single_b.rs:137).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tbraun96 and others added 5 commits May 25, 2026 16:05
Mirrors F78 (Write family non-empty path) for shell-execution tools.
Without this, the qwen3_coder `any_text` body grammar (commit e99159d)
accepts an immediately-closed parameter `<parameter=command></parameter>`,
backfill_required_params then inserts `""`, validator passes, opencode
returns "The argument 'file' cannot be empty. Received ''", and the
model burns its turn budget retrying the same empty call.

Live opencode v44 hit exactly this pattern: cargo init succeeded twice
clean, then 2× empty bash, then the model "started fresh" on a wrong
version number and burned to max_tokens.

The previous `json_schema` body grammar paired with
`enforce_min_length_on_required_strings` (`grammar/schema.rs`) enforced
min_length 1 at the FSM level. Lifting that check to the validator
post-parse keeps the same invariant while letting the grammar body be
`any_text` (the native qwen3_coder XML wire format the parser at
`tool_parser/parse_single_b.rs:60-132` actually expects).

Verification on v45 (atlas-gb10:fp8-grammar-shellfix):
  - cargo init succeeded byte-exact `/tmp/test-rust-axum-v45`.
  - read Cargo.toml clean.
  - write Cargo.toml with `axum = { version = "= 0.8", features=["json"] }`.
  - One content-loop attractor (`version = { version =` period-2 repeat)
    correctly caught by the existing watchdog; model recovered next turn.
  - 0 empty-bash retries — the new validator stops them at the gate.

Out of scope (deferred):
  - The Write tool's `content` parameter is still accepted as empty —
    F78 only carves out PATH_KEYS. Adding `content` would interact with
    legitimate "create empty file" intents, so deferred until live
    evidence shows it's a regression risk.
  - The MoE FP8 dequant drift identified by the dgx2 per-op cosine study
    (`bench/fp8_dgx2_drift/MASTER_DRIFT_TABLE.md`, worst op
    ssm.moe_out @ L20 = 0.91983) — multi-day kernel-level investigation.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Empty-required-string validation rejections (F78 path empty, shellfix
command empty) used to emit a `[atlas] Tool call rejected: …` content
chunk and trip `stop_string_triggered`. But `handle_tool_call_start`
had already emitted the `tool_calls[idx]` header to opencode, so
suppressing the args delta left opencode mid-call with no completion.
opencode then reported the cryptic `SchemaError(Missing key)` instead
of its own per-tool error message ("The argument 'file' cannot be
empty. Received ''"), starving the model of actionable feedback.

Soft errors (any validation message containing "non-empty") are
recoverable — emit the args delta as the model produced them, log a
warn, and let opencode's per-tool schema surface its own error.
Hard errors (unknown tool name, args not valid JSON, malformed path
that looks like a directory) still bail with a content chunk because
they cannot be made into a complete tool call at all.

Same treatment applied to `handle_complete_tool_call` for the
non-streaming complete-call path.

Verification (atlas-gb10:fp8-grammar-softrej, opencode v47):
  - cargo init succeeded byte-exact `/tmp/test-rust-axum-v47`.
  - Subsequent empty-bash retries surfaced opencode's clean error
    "The argument 'file' cannot be empty. Received ''" rather than
    SchemaError(Missing key).
  - Model still drifts into empty-parameter bodies under MoE FP8
    dequant precision drift (see `bench/fp8_dgx2_drift/
    MASTER_DRIFT_TABLE.md`, ssm.moe_out @ L20 = 0.91983) — recovery
    via the cleaner error message alone is insufficient. Sampler-
    level enforcement (mask `</parameter>` close when param body has
    emitted zero non-whitespace chars) is the next intervention.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Atlas's MoE FP8 grouped GEMM was applying the BF16 block-scale to each
FP8-dequant element BEFORE storing it in SMEM as BF16:

    float scale = __bfloat162float(S_exp[n_block * k_blocks + k_block]);
    smem_B[k][n] = __float2bfloat16(E4M3_LUT[byte] * scale);

The `__float2bfloat16(LUT * scale)` truncates each scaled weight to BF16's
7-bit mantissa BEFORE the BF16×BF16 MMA, so the scale's precision interacts
with EVERY single multiply-accumulate. Per the dgx2 per-op drift study at
`bench/fp8_dgx2_drift/MASTER_DRIFT_TABLE.md`, this manifested as the worst
single Atlas op cosine vs HF[BF16-unquant]: ssm.moe_out @ L20 = 0.91983
(8% deviation; unacceptable).

The fix is DeepGEMM's two-level accumulation:

  1. SMEM stores `__float2bfloat16(E4M3_LUT[byte])` only — lossless because
     FP8 E4M3 has 3-bit mantissa and BF16 has 7-bit mantissa.
  2. Inner FP32 accumulator collects BF16×BF16 MMA products for one K=128
     scale-block (8 K_STEP=16 iterations).
  3. At the K-block boundary, multiply the inner accumulator by the
     block-scale (one BF16→FP32 conversion per K-block per CTA) and add
     to the outer FP32 accumulator.
  4. Reset inner accumulator; advance to next K-block.
  5. Final BF16 cast in the epilogue from outer_acc.

Net effect: the scale's BF16 truncation now happens ONCE per K=128 partial
sum, instead of being baked into every dequanted weight. The FP8 weight
precision is preserved through the inner MMA reduction.

Per-CTA observation that simplifies the patch: N_TILE=64 < FP8_BLOCK=128
and `cta_n` is always a multiple of N_TILE, so all 64 N-cols of a CTA
fall within a single N-scale-block. One scale lookup per K-block per CTA,
applied uniformly to all 8 n_tile sub-tiles. No per-tile dispatch needed.

Applied to both kernel variants (`moe_fp8_grouped_gemm` v1 with the
original thread-to-smem mapping, and `moe_fp8_grouped_gemm_v2` with the
coalesced-load thread mapping).

**Measured cosine improvement (Atlas vs HF[BF16-unquant], 10382-token
canonical prompt, dgx1 single-node)**:

  Metric                      BEFORE     AFTER      Δ
  L20 ssm.moe_out (worst)     0.91983 →  0.93547   +0.0156
  ssm.moe_out mean (n=30)     0.97456 →  0.98133   +0.0068
  attn.moe_out mean (n=10)    0.97316 →  0.97806   +0.0049
  ssm.out_proj mean (n=30)    0.97813 →  0.98144   +0.0033
  layer.hidden_out mean       0.98982 →  0.99219   +0.0024

The 1.56pp lift on the worst single op proves the BF16-scale-truncation
was a real precision sink. The remaining ~0.07 gap to perfect cosine at
L20 will be addressed by stacking complementary patches:

  - Native FP8 MMA on SM121 (per agent A6, confirmed available:
    `mma.sync.m16n8k32.f32.e4m3.e4m3.f32` lowers to `QMMA.16832.F32.E4M3.E4M3`)
    — eliminates the BF16 input-quantization step entirely AND yields
    ~2× tensor-core throughput. ~100-200 LoC kernel rewrite.
  - FP32 residual stream (per agent A8, A1) — replaces `residual_add_rms_norm`
    BF16 output with the existing FP32 variant. Reduces input-side drift
    compounding across 40 layers.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
12-hour autonomous mission Epochs 1-3. The 11-agent research synthesis
identified three layered interventions to enforce non-empty tool-call
parameter values against the Qwen3.6-FP8 multi-turn drift class:
A (grammar minLength), B (sampler byte counter), C (parser holdback).
This commit ships A + B. The 4 grammar attempts converged on raw EBNF
as the only architecturally-correct primitive:

  - `[\s\S]*\S[\s\S]*` (regex sandwich): ε-transitive `\S` in Kleene
    closures, FSM skips required content (A3 finding).
  - `[^ \t\r\n<][^<]*` (regex `+` with structural prefix): inner
    enforcement works but outer `+` quantifier still ε-transitive — let
    `</tool_call>` close with zero `<parameter=>` blocks (B6 finding).
  - `json_schema` with `style: "qwen_xml"` and `minLength: 1` (B6):
    json-schema converter has same ε-edge bug for `[^]{1,}` lowering.
  - **EBNF with `value ::= first_char rest` (B5)**: rule INLINING (per
    llama.cpp's GBNF compiler) means the parent rule is expanded to
    require `first_char` as a single non-quantified terminal — no
    ε-transition can skip it. This is what ships.

EBNF body for qwen3_coder XML wire format:
  root      ::= param ("\n" param)*
  param     ::= "<parameter=" paramname ">" value "</parameter>"
  paramname ::= [a-zA-Z_] [a-zA-Z_0-9]*
  value     ::= first_char rest
  first_char::= [^ \t\r\n<]   (forces ≥1 non-WS non-< byte)
  rest      ::= [^<]*

Belt-and-suspenders Tier-1 sampler-side byte counter (decode_logits_seq.rs):
new ActiveSeq fields `inside_parameter_body: bool` and
`param_body_chars_emitted: u32` track per-sequence state. When inside
a parameter body AND no content tokens have been emitted yet,
logit_bias appends `(510, -8.0)` (`</` close-tag opener) PLUS
`(220, 198, 197, 256, 271, -8.0)` (common Qwen3 whitespace tokens).
emit_step.rs flag flip detects opener via last-8-token signature
`[27, 15704, 28]` ending in `>` (29); decrements on `</` (510);
does NOT count whitespace tokens toward the chars counter (so the
mask survives a sampled whitespace token).

Critical Tier-1 gate: forced_token_fastpath at decode_logits_seq.rs:307
was bypassing the logit_bias by returning the grammar's sole legal
token directly. When `tier1_active` (inside_parameter_body && chars==0)
the fastpath is skipped so the sampler pipeline actually applies bias.

Also enabled `wants_typed_arguments=true` on Qwen3CoderParser so
PR-73's `coerce_all` runs schema-aware type coercion (string → number)
on bash's `timeout` field — addresses opencode's
`SchemaError(Expected number, got "30")` failure mode independent of
the param-body work.

**v57 measured result** (Epoch 3 EBNF + Tier-1 sampler):
  - 1 file persisted (Cargo.toml) — best non-precision-stack result
  - axum 0.8 dependency cleanly added with json feature
  - Atlas log shows model emitted real axum code:
    `use axum::{routing::get, Json, Router}; async fn pong() -> Json`
  - Remaining gap: model sometimes emits 1-char garbage values
    (single backslash) to satisfy the 1-char minimum; need
    schema-aware min-length OR Tier-2 closer-suffix holdback.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
12-hour autonomous mission Epoch 4 (final): tightened
validate_single_tool_call so WRITE_FAMILY paths must start with
'/', './', or '../' AND be ≥3 chars, and SHELL_FAMILY commands must
be ≥2 chars after trim. Addresses the model's "satisfy minimum with
single garbage char" failure mode that Epoch 3's EBNF rule-inlining
enforcement exposed (model emitted single backslash `\` for filePath
to satisfy the EBNF ≥1 non-WS non-`<` requirement).

Result: v58 test still scored 0 files because the model now drifts
to entirely-wrong paths like `/test-tu-au-u8` that PASS the loose
"starts with /" validator but bear no resemblance to the requested
target. Confirms: no Atlas-side enforcement can teach the model
TASK SEMANTICS — only structural validity. Multi-turn task
faithfulness gap remains the open problem.

Mission summary written to bench/fp8_dgx2_drift/MISSION_PROGRESS.md.
Best result of the mission: Epoch 3 (raw EBNF) v57 — 1 file
persisted (Cargo.toml with valid axum 0.8 dependency), Atlas logs
captured real axum handler code emitted by the model (`use axum::{
routing::get, Json, Router}; async fn pong() -> Json`). EBNF
rule-inlining is the only architecturally-correct grammar primitive
for non-empty enforcement on Atlas's vendored xgrammar (all 3 prior
regex/json_schema attempts blocked by upstream ε-edge bugs).

Recommended next-mission focus: multi-turn task faithfulness
(FlowKV per-turn KV isolation arXiv:2505.15347, preserve_thinking
chat-template fix per A4 community research, native FP8 MMA to
close remaining 0.04 cosine gap on MoE block).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Checkpoint of the Qwen3.6-35B-A3B-FP8 vs vLLM cosine-parity investigation.

Key fixes:
- paged_oproj.rs: W8A8 o_proj called fp8_gemm_t_blockscaled with N and K
  swapped (passed nq*hd as N, h as K). Fix lifted Atlas-vLLM mean residual
  cosine 0.800 -> 0.997 on short prompts.
- quant_helpers.rs: dequant_fp8_blockscaled_to_bf16 now launches a GPU
  kernel (dequant_fp8_blockscaled_bf16.cu) instead of a D2H + CPU loop +
  H2D round-trip. FP8 MoE BF16-dequant load drops ~22min -> ~49s
  (~130x on the dequant step), bit-identical numerics.

Diagnostics (env-gated, inert by default):
- op_dump hooks for attn_out_pre_gate/post_gate, k_post_norm, k_post_rope,
  q_post_rope in qwen3_attention prefill paths.
- BF16 MoE grouped GEMM + dispatch (ATLAS_FP8_DEQUANT_MOE_TO_BF16), plus
  bench/fp8_dgx2_drift cosine + op-cosine harness.

Findings: MoE GEMM precision is NOT the drift source (BF16 MoE == FP8 MoE
at L39 cos 0.952 vs 0.955; NVFP4 MoE worse at 0.868). Remaining 0.955
floor traced to FP8 attention Q/K/V/O projections.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@tbraun96 tbraun96 mentioned this pull request May 28, 2026
tbraun96 and others added 9 commits May 28, 2026 15:58
…iagnostics

Root-caused the opencode agentic-harness "wandering" (shell-fragment dir
names, collapsed-newline TOML) via a per-decode WS-mask diagnostic — it was
NOT numerical drift but two server-side logit/validation bugs vLLM lacks:

- validation.rs: the write-path validator required filePath to start with
  '/', './' or '../', rejecting bare relative paths like `Cargo.toml` that
  opencode resolves against --dir (and that vLLM accepts). opencode looped
  on the rejection and abandoned the task. Now accepts any non-empty path
  without shell metacharacters/whitespace (still rejects leaked commands
  like `created && ls -R` — CWE-78).

- whitespace_mask.rs: the WS1/WS2 mask suppressed the newline token (198)
  inside tool-param bodies. ATLAS_WS_MASK_DIAG confirmed the model's
  pre-mask top-1 was `\n`, demoted to `;`/` &&`/` ||` — the exact mechanism
  producing shell-fragment paths AND collapsed-newline TOML. Now masks only
  HORIZONTAL whitespace; line breaks are never suppressed. Added
  ATLAS_WS_MASK_OFF=1 to disable the whole mask (matches vLLM, which applies
  no whitespace bias at all).

Diagnostics / kill-switches (env-gated, inert by default):
- ATLAS_WS_MASK_DIAG=1: logs param-body whitespace argmax flips.
- ATLAS_DISABLE_FUZZY_REPAIR=1: no-op the SymSpell tool-arg repair.
- ATLAS_FP8_DEQUANT_ATTN_TO_BF16=1: dequant attention Q/K/V/O to BF16
  (diagnostic, TP=1; confirmed weight precision is not the drift source).
- run_tier.sh --bail: exit on first non-clean run.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Generalizes the WS_MASK_DIAG probe into a full per-decode-step JSONL dump
for Atlas↔vLLM divergence analysis. Each step records raw model top-K
(id,logit) BEFORE Atlas's additive logit-bias stack, the itemized list of
every (id,delta) bias applied (WS mask, attractor, A4 think-suppress, C4
lift — none of which vLLM has), the post-bias argmax, and the sampled
token, sliced by tool-param-body context. Diffing raw_topk against vLLM's
localizes MODEL divergence; the bias list itemizes Atlas-only processor
divergence. Env-gated, inert by default; bias is cloned only when active.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On an exact full-prompt prefix-cache hit (matched == total), the engine
re-embeds and re-runs the last prompt token through all layers to produce
the first generated token's logits. For SSM/Mamba layers that re-run
applies the last token's recurrent update a SECOND time on top of the
restored snapshot state@N, double-advancing the recurrent state and
corrupting both the first-token logits and the state decode reads.

Proven by byte-compare: fresh-prefill SSM state == warm-restored SSM state
exactly (all layers) — save/restore was never the bug; the re-run was.
Intermediate hits (matched < total) recompute the tail from a checkpoint
and process each token once, so they were already correct.

Fix: the leaf snapshot now also stashes the last token's post-final-norm
hidden (hidden_size BF16). On an exact hit, finalize_last re-restores the
pristine SSM state@N (undoing the redundant re-run's advance) and feeds the
stashed hidden straight to lm_head, skipping any SSM re-run. The redundant
1-token forward is otherwise harmless (its KV write duplicates cached
values). Cold/warm output is now byte-identical and warm hits stay fast.

Note: prefill_c.rs prefill_full_cache_hit (two-phase / EP=2) shares the same
pattern and is a known follow-up; not on the single-GPU prefill_b path.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ool-call drift)

Two opencode-quality fixes for Qwen3.6-35B-A3B-FP8, plus diagnostics.

MTP 0% draft acceptance — ROOT CAUSE + FIX
  With ATLAS_FP32_RESIDUAL the main model's hidden stream is FP32, but the
  MTP head's step-2 input rms_norm read the saved hidden as BF16 → NaN →
  all-NaN logits → argmax==0 → every draft was token 0 → verify always
  rejected → 0% acceptance (no speculative speedup). The token embedding
  (always BF16) normed fine; only the FP32 hidden was misread, which is why
  prior weight-quant / vocab / KV-dtype attempts all failed (downstream of
  the NaN). Fix: the MTP hidden-norm now uses the FP32-input rms_norm
  (rms_norm_f32, BF16 out) when use_fp32_residual, so fp32-residual and MTP
  coexist. With the fix MTP accepts ~1.67 drafts/step (~83%) and decode
  rises from ~34 to ~54 tok/s.
  Also: mtp_head now honors BF16 KV (gated by mtp_quantization) instead of
  hard-coding FP8 with unit scales, matching the main model.

BF16 router (ATLAS_BF16_ROUTER) — tool-call drift
  The MoE router/gate was NVFP4 (4-bit); FP8 routing flips drove malformed
  tool calls (empty filePath, mixed <function_calls> tags). Keeping ONLY the
  router in BF16 (a tiny num_experts×h GEMM; experts stay FP8 = full speed)
  eliminates the empty_path drift at no throughput cost.

Diagnostics (env-gated, inert by default): ATLAS_TOOLCALL_DEBUG (post-parse
arg shape on empty-path), ATLAS_MTP_DEBUG_NORMS (staged MTP ‖·‖ to localize
NaN/zero), ATLAS_LMHEAD_BF16 / ATLAS_OFFICIAL_TOOL_PROMPT (A/B levers).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…RECOVERY)

FP8 low-margin token flips corrupt the write tool-call filePath (empty,
truncated dir, or hallucinated like /tmp/pure_axioms.txt) while the CONTENT
is correct — so the Axum server never lands in src/main.rs, cargo builds the
default scaffold (false-positive cargo_valid) and port 3001 never binds.
recover_drifted_write_path() salvages the model's correct content to the
content-inferred in-project path (Cargo.toml / src/main.rs) when the path is
unusable. Env-gated (PCND opt-in); recovers intent, never invents content.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Removes ATLAS_FP32_RESIDUAL + use_fp32_residual() and collapses all ~54 call
sites to the BF16 branch (residual/hidden buffers, rms_norm / residual-add
kernel selection, per-token byte strides). BF16 residual is the shipping
config; the FP32 path was incompatible with MTP (it fed the MTP head FP32
hidden it read as BF16 → NaN → 0% draft acceptance). With FP32-residual gone
the model is always BF16-residual, so MTP works by default. Orphaned FP32 .cu
kernels left in place (unreferenced); the ATLAS_GEMMA4_FP32_LMHEAD scaffolding
is now inert (it required FP32 residual) and can be excised in a follow-up.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Three additive, content-recovering fixes for FP8 tool-call drift on the
opencode rust-axum harness (Qwen3.6-35B-A3B-FP8), all gated behind the
existing ATLAS_WRITE_PATH_RECOVERY flag (PCND opt-in, default-off):

1. Narrate-then-tool fence salvage (tool_salvage): when the model emits
   no native tool_call but renders a whole file in a bare ```rust/```toml
   fence, extract_fenced now infers the write target from the body's
   content shape (fn main(->src/main.rs, [package]->Cargo.toml) and
   synthesises the write(). Plumbs an infer_paths flag from salvage().

2. Write-content markdown fence-unwrap (validation): when the model dumps
   its entire markdown answer into a write() content field (code + stray
   ``` fence + trailing prose / "Principles Applied"), strip the wrapper
   and trailing prose for code files so the .rs/.toml compiles.

3. Path quoted-comma sanitizer (validation): strip a trailing comma/ws so
   a JSON-fragment-shaped filePath like `"/tmp/x/Cargo.toml",` is cleaned
   instead of creating a garbage-named file.

SSOT: extracts classify_path_from_content shared by recover_drifted_write_path
and the fence salvage. Never invents content — only recovers the body/path
the model produced. +8 unit tests (fence contamination, path-comma, salvage
inference); no production behaviour change when the flag is off.

Together these produced Atlas's first end-to-end webserver_ok=true on the
opencode harness. Residual webserver_ok gap is FP8 decode content-completeness
(not serialization) — tracked separately under the BF16-MoE quality work.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… BF16

forward_k2/forward_k3 (the fused multi-token MoE forwards used for MTP
verify and 2/3-sequence batched decode) only had FP8 and NVFP4 branches.
When experts are dequanted FP8->BF16 at load (ATLAS_FP8_DEQUANT_MOE_TO_BF16),
the FP8 source weights are freed, so these forwards read freed memory ->
garbage verify logits -> degenerate repetition under MTP.

Fix: early-return delegate to forward_batched(input, K, ...) when
bf16_gate_weight_ptrs.is_some(). forward_batched already has the tested
per-token BF16 expert path and writes the same moe_output()[K,H]. SSOT:
reuses the decode BF16 kernels, no new fused batch2/batch3 kernel.

Verified: MoE-dequant + MTP is now coherent (was repetition), stable across
a full N=10 (no slot leak, no crash). Inert for the FP8/NVFP4 paths
(bf16_gate_weight_ptrs is None without the dequant flag).

Note: the harness-level BF16-MoE quality lift does not beat the FP8 baseline
(decode-speed cost + partial quality); see memory project_bf16_dequant_lever.
This commit lands the correctness fix regardless.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Diagnostic probe tests (eprintln, no assertions) exploring TOML auto-repair
behaviour on collapsed-one-line manifests with XML content leaks
(`</content>`) and missing-`=` shapes observed in earlier harness runs.
Scratch from a prior debugging session, preserved on request.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>


def load_config_rms_eps() -> float:
cfg = json.load(open(SNAP / "config.json"))


if __name__ == "__main__":
sys.exit(main())
tbraun96 and others added 10 commits May 31, 2026 15:12
…ostics

Two env-gated, default-off diagnostics for pinpointing async CUDA faults:
- ATLAS_DEBUG_SYNC_KERNELS (kernel_args.rs): synchronize the stream after
  every KernelLaunch::launch so an illegal-address fault surfaces AT the
  culprit kernel (with grid/block + a force-captured backtrace) instead of
  at a later, unrelated sync. Env read once via OnceLock — zero prod cost.
- ATLAS_DEBUG_NO_GRAPH (impl_a1.rs): force eager batched decode (suppress
  CUDA-graph capture) so the per-launch sync above is permitted (you cannot
  cuStreamSynchronize a capturing stream → 900).

Together these pinpointed the BF16 o_proj OOB (next commit).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…_BF16

ms_phase_o_proj (multi-seq batched-decode O projection) had FP8 and NVFP4
branches but NO BF16 branch. Under ATLAS_FP8_DEQUANT_ATTN_TO_BF16 the o_proj
weight is dequanted to BF16 (o_dense_bf16), so o_weight.as_fp8() returns None
and the dispatch fell through to the NVFP4 w4a16_gemv_batch{2,3} path using the
stale FP8/NVFP4 self.attn.o_proj — reading mismatched weight bytes →
CUDA_ERROR_ILLEGAL_ADDRESS (700) in batched decode (n>=2 / MTP).

Fix: add the o_dense_bf16 branch first (per-token dense_gemv), mirroring the
single-seq decode path (attention_forward_oproj.rs). Inert for FP8/NVFP4
(o_dense_bf16 is None without the dequant flag).

Verified: full bit-perfect config (MoE+attn BF16 + MTP) now runs coherent
batched decode with zero CUDA-700 (was: crash). Enabling fix for the BF16
webserver_ok lever (#15).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Layers a freshly-built spark binary onto the prior alpha image (FROM a local
atlas-gb10 tag) to skip the multi-minute CUDA build during iteration — same
pattern as Dockerfile.fence/.fast. Used to build the BF16 dequant / debug-sync
diagnostic images this session. Base tag is local-only; not reproducible on a
fresh clone (matches existing fast-layer Dockerfiles).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New per-layer gate for the FP8→BF16 dequant path: ATLAS_FP8_DEQUANT_LAYERS
takes a comma-separated list of singletons/inclusive ranges (e.g. "31-39" or
"31,35,39") of absolute layer indices. Unset → all layers (legacy behaviour,
no change). Gates both ATLAS_FP8_DEQUANT_MOE_TO_BF16 and _ATTN_ via a single
layer_dequant_selected() helper (parsed once via OnceLock).

Motivation: full-BF16 dequant lands the best per-token quality but decode is
~30% slower (2× MoE weight bandwidth → 360s harness timeouts; task #231).
Restricting BF16 to the worst-drift late layers (L31-39) keeps the bulk FP8-
fast: measured 44.7 tok/s (vs full-BF16 33.1, FP8 47.7) — ~94% of FP8 speed —
while reaching cargo_valid 10/10 on the opencode harness (best of any tier;
FP8=8/10, full-BF16=9/10, =vLLM 10/10).

webserver_ok stays ~0-1/10 across all quant configs: confirmed NOT a quant-
quality or decode-speed problem — it's the agentic content-completeness ceiling
(model reliably emitting a complete binding server), per the gap forensic.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FP8 agentic failure mode (forensic gap #9): the model explores (bash ls/cat/
find, read, glob) or narrates across many turns but never writes the
deliverable, so the run ends with a valid Cargo.toml but no real src/main.rs —
webserver_ok never fires. BW1 tallies tool-call productivity across the
conversation (write/edit tools + bash write/build/run verbs = productive;
ls/cat/find/grep/read/glob = exploration) and, when the agent has made many
calls with zero productive output, appends a steering nudge to the most recent
tool response redirecting it to write the file(s) and verify. Escalates to a
CRITICAL nudge past ~9 calls.

Lives in hint_injector (alongside the error-recovery hints); injected at
msg_entry after the conversation scan. Env-gated ATLAS_BASH_WANDER_WATCHDOG=1
(PCND, default-off). +2 unit tests (classifier + threshold/escalation).

Targets the only remaining webserver_ok lever after the quant/speed track
(cargo_valid 10/10 via selective L31-39); forensic est +0..+2, partly
model-inherent.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sweeps)

N=10 run JSONs + aggregate reports for the ws_ok lever investigation:
fencecontent/pathrec baselines, bf16moe/bf16both/bf16sel selective-BF16
levers, bw1_sel anti-wander watchdog (confirmed no-op), vLLM 13/13 reference.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ments in Cargo.toml

extract.rs: a bare ``` at end-of-content with no trailing newline made
body_start = content.len()+1, panicking the slice at extract.rs:121 and
crashing the salvage worker (observed live in disable-thinking opencode runs)
— so that request's recovered write never landed. Guard body_start > len.

toml_repair.rs: the FP8 model sometimes annotates Cargo.toml with C-style
`//` comments (TOML uses #), e.g. `name = "x"  // not real TOML`, making
the file unparseable. Add a conservative char-safe stripper (only outside
double-quoted strings, so URLs/# are untouched) as a repair candidate.

Both unit-tested (toml_repair 11/11; extract EOF-fence regression). Found
during the 12h /loop while investigating the Atlas-vs-vLLM webserver_ok gap.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lling the agentic verify loop

The spinning detector counted any assistant turn with content<500 AND
tool_args<100 as 'short'; 5 consecutive → suppress_tool_call (-12 bias on
<tool_call> start). But in an agentic coding loop the verify cycle
(bash cargo build / cargo run / curl, read, small edit) is a run of
legitimately short-arg tool calls — so it tripped after ~5 turns and
hard-masked the next tool call, killing the build→error→fix→rebuild loop.
Measured: Atlas capped at ~4-5 turns / 2 bash vs vLLM's 12-17 turns / 8 bash
on the same task (vLLM has no such suppressor → 10/10 webserver_ok).

Fix: a turn that issued ANY non-empty tool call is taking an action
(progress), so it is substantial — spinning now only fires on consecutive
short PURE-TEXT turns. Genuine repeated-tool-call loops remain caught by
loop_detector::detect (the Suppress verdict).

Live result: turns 4→9-15, bash 2→5-7, 'Spinning detection fired' → 0,
builds now succeed, and Atlas landed its FIRST webserver_ok=true on the
opencode harness (real axum server, /ping→pong on 0.0.0.0:3001).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…under-covers a full-prompt match

#226 warm-hit SSM corruption: when an intermediate checkpoint matched at full
prompt length (ssm_snapshot_tokens < matched == total — e.g. the leaf snapshot
was LRU-evicted from the 16-slot pool, leaving only a block-aligned checkpoint),
prefix_lookup.rs forced skip_tokens=matched, restoring SSM state to snap_tok but
advancing positions/KV to total → stale recurrent h_state/conv_state by
(total-snap_tok) tokens → first decoded token reads misaligned state → healthy
logits but wrong → immediate stop / empty completion. Clusters in later N=10 runs
(warm cross-request cache). Fix: skip only to snap_tok so suffix-prefill replays
SSM over the gap (mirrors the matched<total intermediate path); exact-leaf hit
(snap_tok==matched==total) still routes through the 25f8bbe fixup. Same guard
added to the EP=2 two-phase path (prefill_c.rs).

Live N=10 (cache ON): give-ups 4→1, ws_ok 2→3/10. PARTIAL — a residual warm-hit
path remains (cache-OFF still best at 6/10). Correct direction (fix corruption vs
disable caching); residual is a known follow-up.
run_tier.sh: OC_TIMEOUT env knob (default 360) for de-confounding runs.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…pro tooling + bench evidence

prefix_lookup.rs: env-gated (default OFF) bypass of the exact-leaf snapshot
shortcut — routes exact full-prompt hits through full recompute. Used to rule
out the marconi_exact_snap fixup as the cache-ON regression cause (bypass still
2/10). Inert in production.

Adds prefix_partial_hit_repro.py + N=10 run JSONs from the prefix-cache
investigation (spinfix/pfxfix/pfxfix2/exactprobe/cacheoff2/bf16* tiers).

Findings (memory project_gaps_to_10of10): single-forward partial-hit is
bit-exact (verified, identical-prompt cold-vs-warm); harness cache-ON regression
(23% vs cache-OFF 65%, z=3.2) is real but lives in the multi-turn/eviction/batched
regime — next: instrumented multi-turn repro.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@tbraun96 tbraun96 changed the title fix(streaming): detect & cancel in-think <tool_call> leak (Qwen3.6 + opencode) fix(coherence): Debugging to Get Qwen Working In Agentic Coding Jun 1, 2026
…og hardening

First clean Atlas-FP8 webserver_ok (atlas_capfix run2: build+bind+ping=pong, 9 turns,
309s). Snapshot of the fix/in-think-tool-call-leak working tree at the capfix milestone
(the exact state atlas-gb10:capfix was built from).

- F1: unconditional per-generation post-think content-token cap
      (max_post_think_content_tokens, default 100_000 no-op; qwen3.6 MODEL.toml=1536),
      plumbed atlas-kernels -> WatchdogParams; caps in emit_step.rs + decode_logits_content.rs
- F4: sticky tool_request bool so the inter-tool prose-budget survives grammar disengage
- F5: max_inter_tool_prose 2048 -> 1024 (qwen3.6 MODEL.toml only)
- F2: bounded value-rest rest_part{0,6000} behind ATLAS_GRAMMAR_VALUE_HARDEN (default OFF)
- harness: aggregate.py/run_tier.sh exit code = total cargo+webserver failure count

Revert target (pre-session committed base): 7e8e2d6

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tbraun96 and others added 3 commits June 3, 2026 10:20
… serialization parity

ROOT CAUSE of the opencode webserver_ok gap (Atlas-FP8 ~0/10 vs vLLM-FP8 10/10 on
IDENTICAL FP8 weights): repetition_penalty=1.1 in MODEL.toml [sampling.*]. At the
harness temp=0, the frequency penalty (256-token window) suppresses the most-recurrent
code tokens (newline id198, `::` id476) and flips low-margin greedy picks -> collapsed
newlines / `::`-splitting / prose-wander / runaway-to-length. vLLM runs penalty-free
greedy so never flips -- which is why the *more precise* engine was the failing one.
NOT a precision floor: Atlas-vs-BF16 cosine 0.99523 > vLLM 0.98482 at 40/40 layers.

- kernels/gb10/qwen3.6-35b-a3b/MODEL.toml: repetition_penalty 1.1->1.0 in all four
  [sampling.*] presets; dry_multiplier=0.5 KEPT (DRY catches the verbatim filename
  runaway that prompted the prior revert; rep_penalty is the code-corruptor). Note the
  selector (sampling_setup.rs:53) uses tools|thinking_text|non_thinking; thinking_coding
  is currently dead but set to 1.0 too.
- tool-JSON: serde_json + minijinja `preserve_order` + a custom spaced `tojson` filter so
  the <tools> block byte-matches transformers/vLLM (Cargo.toml, spark-server/Cargo.toml,
  tokenizer.rs, jinja_helpers.rs).

Result: norep2 N=10 = 8/10 webserver_ok, 9/10 cargo_valid (was 0/39 across all configs
this session). Also re-enables the fast GPU-greedy decode path (gated off when
repetition_penalty != 1.0).

Also carries prior env-gated additions (default-off, inert unless enabled):
ATLAS_LOOP_NO_SUPPRESS (loop_detect.rs), ATLAS_TOOL_SHORT_TRIGGER (compile_tools.rs),
qwen3_coder leak-markers (qwen3_coder.rs).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ids, unify logit paths, fix >=-merge grammar boundary

Drives Qwen3.6-35B-A3B-FP8 opencode webserver_ok from 8/10 to a clean 10/10
with MTP on (BF16-KV, N=10), and proves the long-standing gap was never an
"FP8 precision floor" but a stack of Atlas-specific band-aids plus a
tokenizer grammar-boundary bug.

Grammar (compile_tools.rs):
- value ::= leading_ws* first_content rest, so a leading newline (the
  model's genuine top-1 at content-start) is no longer masked into a
  lean/cargo drift runner-up; keeps the non-empty guard.
- Exclude =/> from first_content. The param key closes with '>' and the
  tokenizer fuses it into >X merge tokens (>= is id 9628); at the
  <parameter=KEY>->value boundary the model could emit >=, depositing a
  phantom '=' as the value's first char and breaking edit oldString matches
  (the "give-up": 3 failed edits -> watchdog kill of a coherent agent).
  Excluding =/> makes xgrammar reject the merge token at the boundary.

Remove Atlas-only band-aids (no analogue in vLLM/SGLang/TRT-LLM):
- AM1 lean-attractor logit bias + pos-0 close-tag bias + tier1 fast-path
  bypass + the attractor_mask module.
- Whole tool-call repair/salvage stack: toml_repair (SC1), fuzzy_repair
  (A2/SymSpell), tool_retry (Tier5c), tool_salvage (narrate->tool),
  repair_json, and the validation.rs ATLAS_WRITE_PATH_RECOVERY block.
- Whitespace mask (WS1/WS2).
- Dead env vars: ATLAS_DISABLE_FUZZY_REPAIR, ATLAS_WRITE_PATH_RECOVERY,
  ATLAS_TOOL_RETRY.

Unify the two per-token logit paths (SSOT): the non-MTP decode path was a
stale inline copy of run_pipeline; both now route through
process_position_logits. Fixes A4 being silently dead on the MTP verify
path; B1 margin detector moved to b1_margin.rs (FinalDecode-gated). Three
guard tests make the "logit rule on one path only" regression impossible.

MODEL.toml: dry_multiplier 0.5->0.0 (unlocks the on-GPU fast-greedy path,
~30->48 tok/s on BF16-KV; rep_penalty already 1.0).

SSM slot-leak-on-abort fix (RAII SlotGuard, ssm_pool.rs) so no-MTP runs do
not brick.

Result: MTP-on BF16-KV N=10 = 10/10 webserver_ok, 0 phantom-= corruptions,
all drift markers 0, 49 tok/s. spark-server 459 tests + spark-model 69 green.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
…lit-K negative result

ATLAS_DECODE_TIMING=1 (gated, zero-cost off): per-token host-path stage
timers in process_decode_logits splitting copy+forward-wait vs the
248k-vocab host sampling loops. Measured: host sampling is ~0.68ms/tok
(negligible); the per-token decode cost is the GPU forward.

split_ref_seqs: reverted a single-stream split-K unpin experiment — a
clean A/B (same 21.8k code task) was byte-identical (12.7 tok/s both),
confirming attention occupancy is NOT the long-ctx decode bottleneck.
Determinism pin kept intact; negative result documented in-code.

These localize the real bottleneck (separate kernel work): prefill is
~20x slower than vLLM, owned by the naive W8A16 projection GEMM (38%) +
MoE FFN GEMM (40%) at 1.4-7 TFLOP/s, NOT the GDN scan (11%).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant